Serveur d'exploration sur l'OCR

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

The Impact of OCR Accuracy on Automatic Text Classification

Identifieur interne : 001476 ( Main/Exploration ); précédent : 001475; suivant : 001477

The Impact of OCR Accuracy on Automatic Text Classification

Auteurs : Guowei Zu [Japon] ; Mayo Murata [Japon] ; Wataru Ohyama [Japon] ; Tetsushi Wakabayashi [Japon] ; Fumitaka Kimura [Japon]

Source :

RBID : ISTEX:6A65353E3FD60DFE635CA2ED9D1812109AA85327

Descripteurs français

English descriptors

Abstract

Abstract: Current general digitization approach of paper media is converting them into the digital images by a scanner, and then reading them by an OCR to generate ASCII text for full-text retrieval. However, it is impossible to recognize all characters with 100% accuracy by the present OCR technology. Therefore, it is important to know the impact of OCR accuracy on automatic text classification to reveal its technical feasibility. In this research we perform automatic text classification experiments for English newswire articles to study on the relationships between the accuracies of OCR and the text classification employing the statistical classification techniques.

Url:
DOI: 10.1007/978-3-540-30483-8_49


Affiliations:


Links toward previous steps (curation, corpus...)


Le document en format XML

<record>
<TEI wicri:istexFullTextTei="biblStruct">
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en">The Impact of OCR Accuracy on Automatic Text Classification</title>
<author>
<name sortKey="Zu, Guowei" sort="Zu, Guowei" uniqKey="Zu G" first="Guowei" last="Zu">Guowei Zu</name>
</author>
<author>
<name sortKey="Murata, Mayo" sort="Murata, Mayo" uniqKey="Murata M" first="Mayo" last="Murata">Mayo Murata</name>
</author>
<author>
<name sortKey="Ohyama, Wataru" sort="Ohyama, Wataru" uniqKey="Ohyama W" first="Wataru" last="Ohyama">Wataru Ohyama</name>
</author>
<author>
<name sortKey="Wakabayashi, Tetsushi" sort="Wakabayashi, Tetsushi" uniqKey="Wakabayashi T" first="Tetsushi" last="Wakabayashi">Tetsushi Wakabayashi</name>
</author>
<author>
<name sortKey="Kimura, Fumitaka" sort="Kimura, Fumitaka" uniqKey="Kimura F" first="Fumitaka" last="Kimura">Fumitaka Kimura</name>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">ISTEX</idno>
<idno type="RBID">ISTEX:6A65353E3FD60DFE635CA2ED9D1812109AA85327</idno>
<date when="2004" year="2004">2004</date>
<idno type="doi">10.1007/978-3-540-30483-8_49</idno>
<idno type="url">https://api.istex.fr/document/6A65353E3FD60DFE635CA2ED9D1812109AA85327/fulltext/pdf</idno>
<idno type="wicri:Area/Istex/Corpus">000A37</idno>
<idno type="wicri:Area/Istex/Curation">000A25</idno>
<idno type="wicri:Area/Istex/Checkpoint">000D11</idno>
<idno type="wicri:doubleKey">0302-9743:2004:Zu G:the:impact:of</idno>
<idno type="wicri:Area/Main/Merge">001527</idno>
<idno type="wicri:source">INIST</idno>
<idno type="RBID">Pascal:05-0037779</idno>
<idno type="wicri:Area/PascalFrancis/Corpus">000495</idno>
<idno type="wicri:Area/PascalFrancis/Curation">000294</idno>
<idno type="wicri:Area/PascalFrancis/Checkpoint">000452</idno>
<idno type="wicri:doubleKey">0302-9743:2004:Zu G:the:impact:of</idno>
<idno type="wicri:Area/Main/Merge">001663</idno>
<idno type="wicri:Area/Main/Curation">001476</idno>
<idno type="wicri:Area/Main/Exploration">001476</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title level="a" type="main" xml:lang="en">The Impact of OCR Accuracy on Automatic Text Classification</title>
<author>
<name sortKey="Zu, Guowei" sort="Zu, Guowei" uniqKey="Zu G" first="Guowei" last="Zu">Guowei Zu</name>
<affiliation wicri:level="1">
<country xml:lang="fr">Japon</country>
<wicri:regionArea>Faculty of Engineering, Mie University, 1515 Kamihama-cho, Tsu-shi, 5148507, Mie</wicri:regionArea>
<wicri:noRegion>Mie</wicri:noRegion>
</affiliation>
<affiliation wicri:level="3">
<country xml:lang="fr">Japon</country>
<wicri:regionArea>Toshiba Solutions Corporation, Systems Integration Technology Center, Toshiba Building, 1-1, Shibaura 1-chome, Minato-ku, 105-6691, Tokyo</wicri:regionArea>
<placeName>
<settlement type="city">Tokyo</settlement>
</placeName>
</affiliation>
</author>
<author>
<name sortKey="Murata, Mayo" sort="Murata, Mayo" uniqKey="Murata M" first="Mayo" last="Murata">Mayo Murata</name>
<affiliation wicri:level="1">
<country xml:lang="fr">Japon</country>
<wicri:regionArea>Faculty of Engineering, Mie University, 1515 Kamihama-cho, Tsu-shi, 5148507, Mie</wicri:regionArea>
<wicri:noRegion>Mie</wicri:noRegion>
</affiliation>
</author>
<author>
<name sortKey="Ohyama, Wataru" sort="Ohyama, Wataru" uniqKey="Ohyama W" first="Wataru" last="Ohyama">Wataru Ohyama</name>
<affiliation wicri:level="1">
<country xml:lang="fr">Japon</country>
<wicri:regionArea>Faculty of Engineering, Mie University, 1515 Kamihama-cho, Tsu-shi, 5148507, Mie</wicri:regionArea>
<wicri:noRegion>Mie</wicri:noRegion>
</affiliation>
</author>
<author>
<name sortKey="Wakabayashi, Tetsushi" sort="Wakabayashi, Tetsushi" uniqKey="Wakabayashi T" first="Tetsushi" last="Wakabayashi">Tetsushi Wakabayashi</name>
<affiliation wicri:level="1">
<country xml:lang="fr">Japon</country>
<wicri:regionArea>Faculty of Engineering, Mie University, 1515 Kamihama-cho, Tsu-shi, 5148507, Mie</wicri:regionArea>
<wicri:noRegion>Mie</wicri:noRegion>
</affiliation>
</author>
<author>
<name sortKey="Kimura, Fumitaka" sort="Kimura, Fumitaka" uniqKey="Kimura F" first="Fumitaka" last="Kimura">Fumitaka Kimura</name>
<affiliation wicri:level="1">
<country xml:lang="fr">Japon</country>
<wicri:regionArea>Faculty of Engineering, Mie University, 1515 Kamihama-cho, Tsu-shi, 5148507, Mie</wicri:regionArea>
<wicri:noRegion>Mie</wicri:noRegion>
</affiliation>
</author>
</analytic>
<monogr></monogr>
<series>
<title level="s">Lecture Notes in Computer Science</title>
<imprint>
<date>2004</date>
</imprint>
<idno type="ISSN">0302-9743</idno>
<idno type="eISSN">1611-3349</idno>
<idno type="ISSN">0302-9743</idno>
</series>
<idno type="istex">6A65353E3FD60DFE635CA2ED9D1812109AA85327</idno>
<idno type="DOI">10.1007/978-3-540-30483-8_49</idno>
<idno type="ChapterID">49</idno>
<idno type="ChapterID">Chap49</idno>
</biblStruct>
</sourceDesc>
<seriesStmt>
<idno type="ISSN">0302-9743</idno>
</seriesStmt>
</fileDesc>
<profileDesc>
<textClass>
<keywords scheme="KwdEn" xml:lang="en">
<term>Automatic classification</term>
<term>Character recognition</term>
<term>Content analysis</term>
<term>Content management</term>
<term>Digital image</term>
<term>Digitizing</term>
<term>Feasibility</term>
<term>Full text</term>
<term>Image scanners</term>
<term>Information retrieval</term>
<term>Information system</term>
<term>Optical character recognition</term>
<term>Probabilistic approach</term>
<term>Statistical analysis</term>
<term>World wide web</term>
</keywords>
<keywords scheme="Pascal" xml:lang="fr">
<term>Analyse contenu</term>
<term>Analyse statistique</term>
<term>Approche probabiliste</term>
<term>Classification automatique</term>
<term>Faisabilité</term>
<term>Gestion contenu</term>
<term>Image numérique</term>
<term>Numérisation</term>
<term>Recherche information</term>
<term>Reconnaissance caractère</term>
<term>Reconnaissance optique caractère</term>
<term>Réseau web</term>
<term>Scanneur image</term>
<term>Système information</term>
<term>Texte intégral</term>
</keywords>
<keywords scheme="Wicri" type="topic" xml:lang="fr">
<term>Numérisation</term>
</keywords>
</textClass>
<langUsage>
<language ident="en">en</language>
</langUsage>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">Abstract: Current general digitization approach of paper media is converting them into the digital images by a scanner, and then reading them by an OCR to generate ASCII text for full-text retrieval. However, it is impossible to recognize all characters with 100% accuracy by the present OCR technology. Therefore, it is important to know the impact of OCR accuracy on automatic text classification to reveal its technical feasibility. In this research we perform automatic text classification experiments for English newswire articles to study on the relationships between the accuracies of OCR and the text classification employing the statistical classification techniques.</div>
</front>
</TEI>
<affiliations>
<list>
<country>
<li>Japon</li>
</country>
<settlement>
<li>Tokyo</li>
</settlement>
</list>
<tree>
<country name="Japon">
<noRegion>
<name sortKey="Zu, Guowei" sort="Zu, Guowei" uniqKey="Zu G" first="Guowei" last="Zu">Guowei Zu</name>
</noRegion>
<name sortKey="Kimura, Fumitaka" sort="Kimura, Fumitaka" uniqKey="Kimura F" first="Fumitaka" last="Kimura">Fumitaka Kimura</name>
<name sortKey="Murata, Mayo" sort="Murata, Mayo" uniqKey="Murata M" first="Mayo" last="Murata">Mayo Murata</name>
<name sortKey="Ohyama, Wataru" sort="Ohyama, Wataru" uniqKey="Ohyama W" first="Wataru" last="Ohyama">Wataru Ohyama</name>
<name sortKey="Wakabayashi, Tetsushi" sort="Wakabayashi, Tetsushi" uniqKey="Wakabayashi T" first="Tetsushi" last="Wakabayashi">Tetsushi Wakabayashi</name>
<name sortKey="Zu, Guowei" sort="Zu, Guowei" uniqKey="Zu G" first="Guowei" last="Zu">Guowei Zu</name>
</country>
</tree>
</affiliations>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Ticri/CIDE/explor/OcrV1/Data/Main/Exploration
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 001476 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Main/Exploration/biblio.hfd -nk 001476 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Ticri/CIDE
   |area=    OcrV1
   |flux=    Main
   |étape=   Exploration
   |type=    RBID
   |clé=     ISTEX:6A65353E3FD60DFE635CA2ED9D1812109AA85327
   |texte=   The Impact of OCR Accuracy on Automatic Text Classification
}}

Wicri

This area was generated with Dilib version V0.6.32.
Data generation: Sat Nov 11 16:53:45 2017. Site generation: Mon Mar 11 23:15:16 2024